Analysis & Implementation of Weather Forecasting using ML

Dataset statistics

Number of variables10
Number of observations96432
Missing cells0
Missing cells (%)0.0%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory7.4 MiB
Average record size in memory80.0 B

Variable types

DateTime1
Numeric9

Alerts

maxtempC is highly correlated with mintempC and 4 other fieldsHigh correlation
mintempC is highly correlated with maxtempC and 3 other fieldsHigh correlation
cloudcover is highly correlated with maxtempC and 2 other fieldsHigh correlation
humidity is highly correlated with maxtempC and 2 other fieldsHigh correlation
sunHour is highly correlated with maxtempC and 2 other fieldsHigh correlation
HeatIndexC is highly correlated with maxtempC and 3 other fieldsHigh correlation
pressure is highly correlated with mintempC and 1 other fieldsHigh correlation
date_time has unique values Unique
cloudcover has 6533 (6.8%) zeros Zeros
precipMM has 84604 (87.7%) zeros Zeros

Reproduction

Analysis started2022-10-14 16:31:22.146809
Analysis finished2022-10-14 16:31:50.006657
Duration27.86 seconds
Software versionpandas-profiling v3.3.0
Download configurationconfig.json

Variables

date_time
Date

UNIQUE

Distinct96432
Distinct (%)100.0%
Missing0
Missing (%)0.0%
Memory size753.5 KiB
Minimum2009-01-01 00:00:00
Maximum2020-01-01 23:00:00
Histogram with fixed size bins (bins=50)

maxtempC
Real number (ℝ≥0)

HIGH CORRELATION

Distinct23
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean29.64609258
Minimum18
Maximum40
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size753.5 KiB

Quantile statistics

Minimum18
5-th percentile25
Q127
median29
Q332
95-th percentile36
Maximum40
Range22
Interquartile range (IQR)5

Descriptive statistics

Standard deviation3.44642703
Coefficient of variation (CV)0.1162523196
Kurtosis-0.139291169
Mean29.64609258
Median Absolute Deviation (MAD)2
Skewness0.4274565427
Sum2858832
Variance11.87785927
MonotonicityNot monotonic
Histogram with fixed size bins (bins=23)
ValueCountFrequency (%)
2814976
15.5%
2913296
13.8%
2711160
11.6%
309384
9.7%
267968
8.3%
316456
6.7%
345088
 
5.3%
355016
 
5.2%
334896
 
5.1%
324392
 
4.6%
Other values (13)13800
14.3%
ValueCountFrequency (%)
1824
 
< 0.1%
1972
 
0.1%
2096
 
0.1%
21216
 
0.2%
22816
 
0.8%
23600
 
0.6%
241776
 
1.8%
253792
 
3.9%
267968
8.3%
2711160
11.6%
ValueCountFrequency (%)
40240
 
0.2%
39336
 
0.3%
38672
 
0.7%
371680
 
1.7%
363480
3.6%
355016
5.2%
345088
5.3%
334896
5.1%
324392
4.6%
316456
6.7%

mintempC
Real number (ℝ≥0)

HIGH CORRELATION

Distinct18
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean19.33673469
Minimum11
Maximum28
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size753.5 KiB

Quantile statistics

Minimum11
5-th percentile14
Q118
median20
Q321
95-th percentile24
Maximum28
Range17
Interquartile range (IQR)3

Descriptive statistics

Standard deviation2.77377135
Coefficient of variation (CV)0.1434456951
Kurtosis-0.05055967691
Mean19.33673469
Median Absolute Deviation (MAD)2
Skewness-0.2244711172
Sum1864680
Variance7.6938075
MonotonicityNot monotonic
Histogram with fixed size bins (bins=18)
ValueCountFrequency (%)
2018384
19.1%
1914424
15.0%
2112360
12.8%
189240
9.6%
228376
8.7%
176864
 
7.1%
235304
 
5.5%
165232
 
5.4%
154896
 
5.1%
143312
 
3.4%
Other values (8)8040
8.3%
ValueCountFrequency (%)
1148
 
< 0.1%
12504
 
0.5%
131776
 
1.8%
143312
 
3.4%
154896
 
5.1%
165232
 
5.4%
176864
 
7.1%
189240
9.6%
1914424
15.0%
2018384
19.1%
ValueCountFrequency (%)
2848
 
< 0.1%
27240
 
0.2%
26552
 
0.6%
251824
 
1.9%
243048
 
3.2%
235304
 
5.5%
228376
8.7%
2112360
12.8%
2018384
19.1%
1914424
15.0%

cloudcover
Real number (ℝ≥0)

HIGH CORRELATION
ZEROS

Distinct101
Distinct (%)0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean34.84748839
Minimum0
Maximum100
Zeros6533
Zeros (%)6.8%
Negative0
Negative (%)0.0%
Memory size753.5 KiB

Quantile statistics

Minimum0
5-th percentile0
Q19
median29
Q354
95-th percentile90
Maximum100
Range100
Interquartile range (IQR)45

Descriptive statistics

Standard deviation28.39102052
Coefficient of variation (CV)0.814722146
Kurtosis-0.6143135663
Mean34.84748839
Median Absolute Deviation (MAD)21
Skewness0.6347253774
Sum3360413
Variance806.0500463
MonotonicityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
06533
 
6.8%
1002868
 
3.0%
42457
 
2.5%
52320
 
2.4%
32258
 
2.3%
62131
 
2.2%
71900
 
2.0%
21810
 
1.9%
81682
 
1.7%
91621
 
1.7%
Other values (91)70852
73.5%
ValueCountFrequency (%)
06533
6.8%
11449
 
1.5%
21810
 
1.9%
32258
 
2.3%
42457
 
2.5%
52320
 
2.4%
62131
 
2.2%
71900
 
2.0%
81682
 
1.7%
91621
 
1.7%
ValueCountFrequency (%)
1002868
3.0%
99174
 
0.2%
98188
 
0.2%
97181
 
0.2%
96219
 
0.2%
95221
 
0.2%
94208
 
0.2%
93240
 
0.2%
92254
 
0.3%
91264
 
0.3%

humidity
Real number (ℝ≥0)

HIGH CORRELATION

Distinct95
Distinct (%)0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean64.89546001
Minimum6
Maximum100
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size753.5 KiB

Quantile statistics

Minimum6
5-th percentile25
Q149
median68
Q383
95-th percentile95
Maximum100
Range94
Interquartile range (IQR)34

Descriptive statistics

Standard deviation21.8568693
Coefficient of variation (CV)0.3368012077
Kurtosis-0.7671365746
Mean64.89546001
Median Absolute Deviation (MAD)17
Skewness-0.4261100313
Sum6257999
Variance477.7227358
MonotonicityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
831757
 
1.8%
861754
 
1.8%
881751
 
1.8%
851739
 
1.8%
841705
 
1.8%
821681
 
1.7%
871677
 
1.7%
801670
 
1.7%
791643
 
1.7%
811639
 
1.7%
Other values (85)79416
82.4%
ValueCountFrequency (%)
61
 
< 0.1%
73
 
< 0.1%
89
 
< 0.1%
926
 
< 0.1%
1058
 
0.1%
1160
 
0.1%
12122
0.1%
13140
0.1%
14156
0.2%
15226
0.2%
ValueCountFrequency (%)
100117
 
0.1%
99697
0.7%
981315
1.4%
971320
1.4%
961244
1.3%
951297
1.3%
941332
1.4%
931390
1.4%
921433
1.5%
911541
1.6%

sunHour
Real number (ℝ≥0)

HIGH CORRELATION

Distinct33
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean10.65348432
Minimum4.2
Maximum12.9
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size753.5 KiB

Quantile statistics

Minimum4.2
5-th percentile6.2
Q18.8
median11.6
Q311.6
95-th percentile12.9
Maximum12.9
Range8.7
Interquartile range (IQR)2.8

Descriptive statistics

Standard deviation1.986738054
Coefficient of variation (CV)0.1864871618
Kurtosis0.9383413886
Mean10.65348432
Median Absolute Deviation (MAD)1.2
Skewness-1.19998749
Sum1027336.8
Variance3.947128094
MonotonicityNot monotonic
Histogram with fixed size bins (bins=33)
ValueCountFrequency (%)
11.642792
44.4%
8.712120
 
12.6%
12.910320
 
10.7%
10.26048
 
6.3%
12.84512
 
4.7%
7.22400
 
2.5%
8.82112
 
2.2%
8.92064
 
2.1%
11.91824
 
1.9%
10.31656
 
1.7%
Other values (23)10584
 
11.0%
ValueCountFrequency (%)
4.2936
 
1.0%
4.3840
 
0.9%
5.71080
1.1%
5.8528
 
0.5%
5.972
 
0.1%
6408
 
0.4%
6.1576
 
0.6%
6.2840
 
0.9%
7.22400
2.5%
7.3120
 
0.1%
ValueCountFrequency (%)
12.910320
 
10.7%
12.84512
 
4.7%
12.5720
 
0.7%
12.3528
 
0.5%
11.91824
 
1.9%
11.8216
 
0.2%
11.642792
44.4%
10.7120
 
0.1%
10.61032
 
1.1%
10.548
 
< 0.1%

HeatIndexC
Real number (ℝ≥0)

HIGH CORRELATION

Distinct31
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean25.26966152
Minimum13
Maximum43
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size753.5 KiB

Quantile statistics

Minimum13
5-th percentile18
Q122
median25
Q328
95-th percentile33
Maximum43
Range30
Interquartile range (IQR)6

Descriptive statistics

Standard deviation4.43081103
Coefficient of variation (CV)0.1753411309
Kurtosis-0.2225966062
Mean25.26966152
Median Absolute Deviation (MAD)3
Skewness0.0183084165
Sum2436804
Variance19.63208638
MonotonicityNot monotonic
Histogram with fixed size bins (bins=31)
ValueCountFrequency (%)
2512633
13.1%
2610512
10.9%
279033
 
9.4%
287642
 
7.9%
206507
 
6.7%
295939
 
6.2%
245707
 
5.9%
194545
 
4.7%
304340
 
4.5%
214294
 
4.5%
Other values (21)25280
26.2%
ValueCountFrequency (%)
1315
 
< 0.1%
14114
 
0.1%
15442
 
0.5%
161129
 
1.2%
172050
 
2.1%
182980
3.1%
194545
4.7%
206507
6.7%
214294
4.5%
223726
3.9%
ValueCountFrequency (%)
434
 
< 0.1%
4211
 
< 0.1%
4119
 
< 0.1%
4033
 
< 0.1%
3954
 
0.1%
38139
 
0.1%
37245
 
0.3%
36477
 
0.5%
35884
0.9%
341477
1.5%

precipMM
Real number (ℝ≥0)

ZEROS

Distinct87
Distinct (%)0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean0.07771901444
Minimum0
Maximum16.9
Zeros84604
Zeros (%)87.7%
Negative0
Negative (%)0.0%
Memory size753.5 KiB

Quantile statistics

Minimum0
5-th percentile0
Q10
median0
Q30
95-th percentile0.4
Maximum16.9
Range16.9
Interquartile range (IQR)0

Descriptive statistics

Standard deviation0.3858652967
Coefficient of variation (CV)4.96487635
Kurtosis224.7864486
Mean0.07771901444
Median Absolute Deviation (MAD)0
Skewness11.19940752
Sum7494.6
Variance0.1488920272
MonotonicityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
084604
87.7%
0.13677
 
3.8%
0.21750
 
1.8%
0.31069
 
1.1%
0.4848
 
0.9%
0.5647
 
0.7%
0.6511
 
0.5%
0.7385
 
0.4%
0.8359
 
0.4%
0.9293
 
0.3%
Other values (77)2289
 
2.4%
ValueCountFrequency (%)
084604
87.7%
0.13677
 
3.8%
0.21750
 
1.8%
0.31069
 
1.1%
0.4848
 
0.9%
0.5647
 
0.7%
0.6511
 
0.5%
0.7385
 
0.4%
0.8359
 
0.4%
0.9293
 
0.3%
ValueCountFrequency (%)
16.91
< 0.1%
16.41
< 0.1%
15.11
< 0.1%
12.71
< 0.1%
12.31
< 0.1%
11.31
< 0.1%
11.11
< 0.1%
10.11
< 0.1%
9.11
< 0.1%
8.81
< 0.1%

pressure
Real number (ℝ≥0)

HIGH CORRELATION

Distinct22
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean1010.554225
Minimum1000
Maximum1021
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size753.5 KiB

Quantile statistics

Minimum1000
5-th percentile1006
Q11008
median1010
Q31013
95-th percentile1016
Maximum1021
Range21
Interquartile range (IQR)5

Descriptive statistics

Standard deviation3.187015913
Coefficient of variation (CV)0.00315373073
Kurtosis-0.5051289699
Mean1010.554225
Median Absolute Deviation (MAD)2
Skewness0.1600586231
Sum97449765
Variance10.15707043
MonotonicityNot monotonic
Histogram with fixed size bins (bins=22)
ValueCountFrequency (%)
100911225
11.6%
101011021
11.4%
100810181
10.6%
101110128
10.5%
10129042
9.4%
10138236
8.5%
10078033
8.3%
10147165
7.4%
10155696
5.9%
10065004
5.2%
Other values (12)10701
11.1%
ValueCountFrequency (%)
10001
 
< 0.1%
100124
 
< 0.1%
100276
 
0.1%
1003345
 
0.4%
10041170
 
1.2%
10052700
 
2.8%
10065004
5.2%
10078033
8.3%
100810181
10.6%
100911225
11.6%
ValueCountFrequency (%)
102111
 
< 0.1%
1020101
 
0.1%
1019353
 
0.4%
1018683
 
0.7%
10171662
 
1.7%
10163575
 
3.7%
10155696
5.9%
10147165
7.4%
10138236
8.5%
10129042
9.4%

windspeedKmph
Real number (ℝ≥0)

Distinct41
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean12.44893811
Minimum0
Maximum41
Zeros16
Zeros (%)< 0.1%
Negative0
Negative (%)0.0%
Memory size753.5 KiB

Quantile statistics

Minimum0
5-th percentile4
Q18
median12
Q316
95-th percentile23
Maximum41
Range41
Interquartile range (IQR)8

Descriptive statistics

Standard deviation5.71676889
Coefficient of variation (CV)0.4592173918
Kurtosis0.8143122374
Mean12.44893811
Median Absolute Deviation (MAD)4
Skewness0.8047187624
Sum1200476
Variance32.68144654
MonotonicityNot monotonic
Histogram with fixed size bins (bins=41)
ValueCountFrequency (%)
97990
 
8.3%
127897
 
8.2%
107694
 
8.0%
116783
 
7.0%
86509
 
6.7%
136443
 
6.7%
145762
 
6.0%
155332
 
5.5%
74838
 
5.0%
64644
 
4.8%
Other values (31)32540
33.7%
ValueCountFrequency (%)
016
 
< 0.1%
1294
 
0.3%
2681
 
0.7%
31621
 
1.7%
42312
 
2.4%
53303
3.4%
64644
4.8%
74838
5.0%
86509
6.7%
97990
8.3%
ValueCountFrequency (%)
411
 
< 0.1%
399
 
< 0.1%
3810
 
< 0.1%
3721
 
< 0.1%
3640
 
< 0.1%
3571
 
0.1%
3452
 
0.1%
3394
0.1%
32136
0.1%
31216
0.2%

Interactions

Correlations

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Missing values

A simple visualization of nullity by column.
Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

Sample

First rows

date_timemaxtempCmintempCcloudcoverhumiditysunHourHeatIndexCprecipMMpressurewindspeedKmph
02009-01-01 00:00:00271229111.6180.010148
12009-01-01 01:00:00271229311.6170.010146
22009-01-01 02:00:00271229411.6160.010144
32009-01-01 03:00:00271229611.6150.010143
42009-01-01 04:00:00271218811.6180.010153
52009-01-01 05:00:00271218011.6220.010163
62009-01-01 06:00:00271207211.6250.010164
72009-01-01 07:00:00271206111.6260.010165
82009-01-01 08:00:00271204911.6270.010156
92009-01-01 09:00:00271203711.6280.010147

Last rows

date_timemaxtempCmintempCcloudcoverhumiditysunHourHeatIndexCprecipMMpressurewindspeedKmph
964222020-01-01 14:00:00261854608.7270.0101317
964232020-01-01 15:00:00261863618.7270.0101217
964242020-01-01 16:00:00261867658.7260.0101316
964252020-01-01 17:00:00261871688.7260.2101316
964262020-01-01 18:00:00261874728.7250.3101415
964272020-01-01 19:00:00261874768.7250.1101416
964282020-01-01 20:00:00261873818.7240.6101516
964292020-01-01 21:00:00261872868.7230.8101617
964302020-01-01 22:00:00261869888.7220.4101616
964312020-01-01 23:00:00261866898.7210.5101616